CLOUDP-349087: Fix TLS disable + scale up test #490

Julien-Ben · 2025-10-02T13:10:07Z

Summary

This test has been broken for a while. Because there were two functions with the name test_tls_is_disabled_and_scaled_up, only one of them was running everytime.

I think this scenario has low chance of existing in production, hence why we never had a ticket related to this bug.

On top of that it performed the update in two separate step, whereas to test this behaviour it should be done in one update.

I uncovered it as part of the larger refactoring of the controller. But to keep the PR scope reasonable, I extracted the related changes as they are self contained
The bug may exist in multi-cluster as well. This bug was first discovered in 2021: https://jira.mongodb.org/browse/CLOUDP-80768

The blocking mechanism will be implemented in a better way after the multi-cluster first refactor, since we will keep track of a global reconciler state, notably holding the target number of replicas for this reconciliation.

Proof of Work

In the commit (1d8847e) which just fixes the e2e test, it fails: evg task Showing that the reconciler had to be fixed as well.

The PR is correct if CI is green again.

Checklist

Have you linked a jira ticket and/or is the ticket in the title?
Have you checked whether your jira ticket required DOCSP changes?
Have you added changelog file?
- use skip-changelog label if not needed
- refer to Changelog files and Release Notes section in CONTRIBUTING.md for more details

github-actions · 2025-10-02T13:10:52Z

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.5.0 Release Notes

New Features

Improve automation agent certificate rotation: the agent now restarts automatically when its certificate is renewed, ensuring smooth operation without manual intervention and allowing seamless certificate updates without requiring manual Pod restarts.

Bug Fixes

MongoDBMultiCluster: fix resource stuck in Pending state if any clusterSpecList item has 0 members. After the fix, a value of 0 members is handled correctly, similarly to how it's done in the MongoDB resource.

Julien-Ben · 2025-10-03T11:10:37Z

docker/mongodb-kubernetes-tests/tests/tls/e2e_tls_disable_and_scale_up.py

 @pytest.mark.e2e_disable_tls_scale_up
 def test_tls_is_disabled_and_scaled_up(replica_set: MongoDB):
    replica_set.load()
    replica_set["spec"]["members"] = 5


The issue in the test is that it was doing the update in two steps (scale, and then disable tls), while the whole point is to change them at the same time. (on top of having a duplicate function name)

lsierant · 2025-10-03T11:50:44Z

controllers/operator/mongodbreplicaset_controller.go

+	// Check if TLS is being disabled. If so, we need to lock replicas at the current member count
+	// to prevent scaling during the TLS disable operation. This decision is made once here and
+	// applied to both the StatefulSet and OM automation config.
+	tlsWillBeDisabled, err := checkIfTLSWillBeDisabled(conn, rs, log)


what if we just block this with validation and say that it's not possible to change both member count and disabling TLS at the same time?

Yes, after discussing it in DM, let's do that instead of adding complexity to the reconcile loop. This is not a common use case anyway.

# CLOUDP-347497 - Single cluster Replica Set Controller Refactoring ## Why this refactoring The single-cluster RS controller was mixing two concerns: - **Kubernetes stuff** (StatefulSets, pods, volumes) - **Ops Manager/MongoDB stuff** (MongoDB processes, replication config) This worked fine for single-cluster, but it's a problem when you think about multi-cluster: - Multi-cluster has **multiple StatefulSets** (one per cluster) but only **one logical ReplicaSet** in Ops Manager - The OM automation config doesn't care about how many K8s clusters you have or how the pods are deployed So we need to separate these layers properly. ## Main changes ### 1. Broke down the huge Reconcile() method Before: ~300 lines of inline logic in Reconcile() Now: ```go Reconcile() ├── reconcileMemberResources() // Handles all K8s resource creation │ ├── reconcileHostnameOverrideConfigMap() │ ├── ensureRoles() │ └── reconcileStatefulSet() // StatefulSet-specific logic isolated here │ └── buildStatefulSetOptions() // Builds STS configuration └── updateOmDeploymentRs() // Handles Ops Manager automation config updates ``` This makes it way easier to understand what's happening and matches the multi-cluster controller structure. ### 2. Removed StatefulSet dependency from OM operations Created new helper functions that work directly with MongoDB resources instead of StatefulSets: - `CreateMongodProcessesFromMongoDB()` - was using StatefulSet before - `BuildFromMongoDBWithReplicas()` - same - `WaitForRsAgentsToRegisterByResource()` - same These mirror the existing `...FromStatefulSet` functions but take MongoDB resources instead. **Why it matters:** The OM layer now only cares about the MongoDB resource definition, not how it's deployed in K8s. This makes the code work the same way for both single-cluster and multi-cluster. ### 3. Added publishAutomationConfigFirstRS checks Dedicated function for RS instead of using the shared one. Does not rely on a statefulset object. ## Important for review The ideal way to review this PR is to compare the new structure to the `mongodbmultireplicaset_controller.go`. The aim of the refactoring is to get the single cluster controller closer to it. Look at: - `reconcileMemberResources()` in both controllers - similar structure now - `updateOmDeploymentRs()` - no more StatefulSet dependency - New helper functions in `om/process` and `om/replicaset` - mirror existing patterns ## Bug found along the way The logic to handle **scale up + disable TLS at the same time** doesn't actually work properly and should be blocked by validation (see [draft PR #490](#490) for more details). ## Tests added Added tests for the new helper functions: - `TestCreateMongodProcessesFromMongoDB` - basic scenarios, scaling, custom domains, TLS, additional config - `TestBuildFromMongoDBWithReplicas` - integration test checking ReplicaSet structure and member options propagation - `TestPublishAutomationConfigFirstRS` - automation config publish logic with various TLS/auth scenarios

Julien-Ben · 2025-10-24T09:03:12Z

Closed in favor of #549

Scale up + disable tls in a single update

1d8847e

Julien-Ben added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Oct 2, 2025

Julien-Ben added 7 commits October 2, 2025 16:50

Tentative fix

2d3e540

Try fix again

dce4046

Effective replicas fix

cd569e1

Only wait for processes with pods

29faeff

Update test for tls disabled

e12bd07

deepcopy update

52bfba1

Edge case with 0 member

20f6845

Julien-Ben commented Oct 3, 2025

View reviewed changes

lsierant reviewed Oct 3, 2025

View reviewed changes

Julien-Ben mentioned this pull request Oct 7, 2025

CLOUDP-347497: Single cluster Replica Set Controller Refactoring #486

Merged

Julien-Ben closed this Oct 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CLOUDP-349087: Fix TLS disable + scale up test #490

CLOUDP-349087: Fix TLS disable + scale up test #490

Uh oh!

Julien-Ben commented Oct 2, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 2, 2025

Uh oh!

Julien-Ben Oct 3, 2025 •

edited

Loading

Uh oh!

lsierant Oct 3, 2025

Uh oh!

Julien-Ben Oct 3, 2025

Uh oh!

Julien-Ben commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLOUDP-349087: Fix TLS disable + scale up test #490

CLOUDP-349087: Fix TLS disable + scale up test #490

Uh oh!

Conversation

Julien-Ben commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Proof of Work

Checklist

Uh oh!

github-actions bot commented Oct 2, 2025

MCK 1.5.0 Release Notes

New Features

Bug Fixes

Uh oh!

Julien-Ben Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lsierant Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Julien-Ben Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Julien-Ben commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Julien-Ben commented Oct 2, 2025 •

edited

Loading

Julien-Ben Oct 3, 2025 •

edited

Loading